Propensity score matching as an effective strategy for biomarker cohort design and omics data analysis

Background Analysis of omics data that contain multidimensional biological and clinical information can be complex and make it difficult to deduce significance of specific biomarker factors. Methods We explored the utility of propensity score matching (PSM), a statistical technique for minimizing confounding factors and simplifying the examination of specific factors. We tested two datasets generated from cohorts of colorectal cancer (CRC) patients, one comprised of immunohistochemical analysis of 12 protein markers in 544 CRC tissues and another consisting of RNA-seq profiles of 163 CRC cases. We examined the efficiency of PSM by comparing pre- and post-PSM analytical results. Results Unlike conventional analysis which typically compares randomized cohorts of cancer and normal tissues, PSM enabled direct comparison between patient characteristics uncovering new prognostic biomarkers. By creating optimally matched groups to minimize confounding effects, our study demonstrates that PSM enables robust extraction of significant biomarkers while requiring fewer cancer cases and smaller overall patient cohorts. Conclusion PSM may emerge as an efficient and cost-effective strategy for multiomic data analysis and clinical trial design for biomarker discovery.


Background
Omics data analysis has become increasingly popular in biomedical research and precision medicine.Due to advances in sequencing and other deep molecular profiling methods and infrastructure, a wealth of multiomic data, spanning from the genome, transcriptome, proteome to the metabolome, have been generated.Omics datasets are complex and associated with multiple biological and clinical parameters.There is currently no standard method for parsing through omics data to extract significant specific biomarkers and separating them from a multitude of possibly confounding variables.Typically, very large randomized patient cohorts are considered to be needed in disease biomarker discovery for addressing the confounding variable problem.
Propensity score matching (PSM) is a statistical matching technique that attempts to deduce the effect of a single specific factor by reducing bias due to confounding variables [1].In past studies, the assignment of a single specific factor to a subject was not random, and there were many confounding variables between the group with the specific factor and the control group without the factor.The propensity score expresses quantitatively "how likely each case is to have the specific factor."The estimated propensity score e(x i ) of each case can be calculated based on its background, i (i = 1,2,. ..,N), which could be a confounding variable.Paired cases with most similar propensity scores can be extracted from each group with or without the single specific factor [2][3][4][5].This process of homogenizing confounding factors is called PSM, and it enables to obtain extracted groups and control groups which have a difference only in the single specific biomarker to be examined.
Although PSM is a powerful statistical method, it has not been implemented into routine omics data analysis for biomarker discovery.In this study, we tested PSM in two colorectal cancer datasets to investigate its efficiency in omics analysis.One dataset was derived from immunohistochemical (IHC) protein expression of 12 protein markers in a cohort of 544 colorectal cancer (CRC) patients [6,7].The second dataset was obtained from The Cancer Genome Atlas database (TCGA) and contained total RNA-sequencing profiling of 163 CRC patients [8].We clustered the cancer cases into good or poor prognosis groups based on patient survival information, and the groups were compared to uncover prognostic protein or transcript biomarkers.We then compared the results before and after PSM implementation to evaluate the efficiency of PSM for omics data analyses and confounding variable-avoiding biomarker discovery.

Methods
The study used retrospective archives samples and was classified by the institution as minimal risk and with waiver of consent (BIDMC IRB 2023P000933).

Dataset of proteomic markers of CRC
This dataset was derived from proteomic expression profiles of 544 surgically resected CRC tissue obtained from the pathology archives of MSKCC.Clinicopathological parameters, including patient age, gender, tumor location, TNM stage, mismatch repair (MMR) status, histology, tumor differentiation, lymphovascular invasion, and perineural invasion were retrieved from medical records.This cohort comprised 367 patients who survived more than three years without recurrence (good outcome group), and 60 patients who had recurrence within three years (poor outcome group).The others did not have sufficient survival information due to short follow-up periods.
Twelve potential prognostic biomarkers (NNMT, GALNT6, SLC3A2, SLC7A5, IGF2BP3, MCM6, SERPIN B5, STAT1, NAMPT, P4HA1, DDX21, and LTBP2) were chosen based on preliminary proteomic findings from our laboratory [6,7].Expression levels of these proteins were determined by immunohistochemical (IHC) staining of tissue microarrays.The tissue microarrays were constructed from 544 formalin fixed and paraffin embedded CRC tissue specimens.Tissue sections were incubated with polyclonal antibodies specific to each protein marker (see S1 Table for details on antibodies used) and visualized with Bond Polymer Refine Detection (Leica).IHC staining scores, which correspond to protein expression levels in cancer tissue, were determined independently by two pathologists without knowledge of the patients' clinical information.
For seven markers, IHC staining intensity of individual tumor cells was determined and assigned intensities of 0, 1+, 2+, or 3+, and the total weighted IHC score of a sample slide was calculated by multiplying the expression intensity of individual tumor areas (score 0-3) by their relative contribution (0-100%) to total tumor area.The total weighted IHC scores thereby had a range of 0-300 for these protein biomarkers.For NNMT, GALNT6, or MCM6, each tissue section was scored by counting the number of cancer cells staining positively for the protein biomarker (staining intensity �1+) relative to the total number of evaluated cancer cells.A minimum of 500 cancer cells were evaluated per tissue sample, and the IHC scores had a range of 0-100.For DDX21 or LTBP2, IHC stains were scored binomially as negative or positive (score 0 or 1+).

Random forest analysis
IHC scorings were compared between the good and poor prognosis groups.A random forest analysis was performed to detect prognostic IHC markers using the R (version 4.2) packages "randomForestSRC" (version 3.3.2) and "ggRandomForests" (version 2.2.1) downloaded from cran.rstudio.com.The analysis included recurrence-free survival data, clinicopathological features, and IHC scorings.The number of trees used was set to 100,000.Minimal depth value and variable importance were utilized to detect significant prognostic factors.Additionally, the significant prognostic IHCs which were selected by random forest analysis were compared to the results of the original comparisons.

Statistical comparison
Clinicopathological values were compared using a Student t-test or a chi-squared test if the values were binary.The comparisons of IHC scorings were also conducted using a Student t-test or a chi-squared test.

Dataset of CRC transcriptome
The original RNA-seq dataset used in this study is available using the "GDCquery" function in the package "TCGAbiolinks" (version 2.18.0) downloaded from bioconductor.org(R version 4.2).The "project" names used are "TCGA-COAD" and "TCGA-READ".The "data.category" name used is "Transcriptome Profiling" and the "data.type"used is "Gene Expression Quantification" for RNA-seq data.The "data.category" name used is "Clinical" for clinical information.
This dataset contains mRNA sequencing counts for 60,660 genes from 163 CRC patients [8].The data were obtained from TCGA-COAD and TCGA-READ using the Bioconductor package in R [9,10].In this CRC cohort, 130 patients survived more than three years after sample collection and were considered in the good prognosis group, and 33 patients died within three years and were considered in the poor prognosis group.The dataset included clinicopathological information such as age, gender, TNM stage, MMR status, histology, residual tumor status, venous invasion, lymphovascular invasion, and perineural invasion.

Differential gene expression analysis
Differential expression analysis was performed to identify genes that were significantly associated with outcome separating good and poor prognosis groups.Differential transcriptome expression analysis of was conducted using the R (version 4.2) packages "limma" (version 3.56.2) and "edgeR" (version 3.42.4)downloaded from bioconductor.org[11][12][13].Transcripts with counts per million (CPM) <15 were filtered out in order to exclude low-expression genes, and the dataset was normalized.The dispersion of gene expression values was estimated [14], and statistical tests of fold change were computed to compare the poor prognosis group to the good prognosis group.Gene transcript changes with a false discovery rate (FDR) less than 0.05 were considered significant.

Propensity score matching (PSM)
A flow chart of how PSM was performed in this study is shown in Fig 1 .Estimated propensity scores of each case e(X i ) were calculated using linear logistic regression: b 0 : the intercept b i : the regression coefficient x i : the variables of each case X i : the specific variable to investigate and covariates Z i = 1: case in a poor prognosis group Z i = 0: case in a good prognosis group The propensity score of each case was computed from the collected clinicopathological features using the "glm" function (generalized linear model) and the "predict" function, both of which are part of the basic R package "stats" (version 4.4.0).
Cases in the group with the smaller number of patients (in this study, the poor prognosis group) were sorted by their propensity scores.The sorting was done from smallest to largest propensity score, and, for each case, a paired case with most similar propensity score was selected from the good prognosis group.If the absolute value of difference between two propensity scores was less than 0.2, the case pair was included and the matched case removed from future matching rounds.If the absolute value of difference was greater than 0.2, the original case was excluded and the matched case retained for next matching rounds.This process resulted in the creation of two extracted groups that had the same number of cases without duplicates, and the number of cases was smaller than the original dataset.

Prognostic proteomic biomarkers of CRC
The proteomic marker dataset of CRC cohort consisted of 367 cases in the good prognosis group and 60 cases in the poor prognosis group (Table 1).Propensity scores were computed for each case, and the resulting distributions of the propensity scores are shown in histograms and density plots (Fig 2).After PSM, group differences of confounding factors, such as age, gender, TNM stage, lymphatic invasion positivity, and perineural invasion positivity, were eliminated.Two paired groups of 58 cases each were created after PSM, a significant reduction in case numbers in both the good and poor prognosis groups when compared to the original data before PSM (Table 1).
Comparisons of protein marker expression as measured by IHC scores between the good and poor prognosis groups before and after PSM are shown in Table 2.In the analysis before PSM, three protein biomarkers (SLC7A5, SLC3A2, and STAT1) showed statistically significant differences between the good and poor prognosis groups.However, only STAT1 showed significant difference between the two groups after cohorts were minimized for confounding variables by PSM.
To validate the above findings, we also conducted random forest analyses on the dataset to identify significant prognostic factors (Fig 3).In addition to clinicopathological features, several markers such as STAT1, NNMT, SLC3A2, and MCM6 were found to be significant prognostic factors based on both minimal depth and variable importance rankings.Except for clinicopathological features such as lymphatic invasion and TNM stage, STAT1 was the most significant prognostic factor based on both types of rankings from random forest analysis.

Prognostic transcriptomic biomarkers of CRC
The TCGA's CRC RNA-seq dataset consisted of 130 cases in the good prognosis group and 33 cases in the poor prognosis group (Table 3).The propensity score of each case was calculated, and the distributions of these scores were shown in histograms and density plots (Fig 4).Before PSM, the good and poor prognosis groups were significantly impacted by six confounding clinicopathological features, including TNM staging, mucinous histology, residual tumor status, venous invasion, lymphatic invasion, and perineural invasion.These confounding features were successfully eliminated by PSM, resulting in two paired groups of 28 cases each with no significant differentiating clinical factors (Table 3).Without PSM implementation, total mRNA sequencing of the original 130 good prognosis cases and 33 poor prognosis cases yielded 13,294 genes (21.9% of all) with counts per million (CPM) greater than 15.Differential gene expression analysis identified 402 genes with false discovery rate less than 0.05, i.e., significantly differentially expressed between the good and After PSM and reducing the cases to only 28 in each group, differential expression analysis identified 12,460 genes (20.5% of all) with CPM greater than 15.122 differential genes were observed with false discovery rate less than 0.05.The 122 significant genes included 109 upregulated and 13 downregulated genes associated with poor prognosis.
Comparing the results before and after PSM, 29 significantly differentially expressed genes were only identified after PSM, while 93 significant genes were identified both without and with PSM analysis (Fig 5).The 29 newly uncovered CRC outcome-differential genes are listed in Table 4. Additionally, the 93 genes selected in common across both comparisons are provided as S2 Table.
Among these 29 genes identified only after PSM are numerous well-known cancer-related markers, e.g., ERBB2, MALAT1, or MUC5AC.A literature search revealed that at least 15 of these markers have been suggested as potential diagnostic or predictive markers for CRC (see refences listed for each potential marker in Table 4).More extensive investigation of these markers and their roles in CRC will be warranted in future studies.

Discussion
Propensity score matching (PSM) is a powerful statistical tool for controlling confounding factors in analyses of complex omics biomarker cohorts.In this study, we demonstrated a PSM strategy for creating two well-balanced groups that mimic the original study cohort but with significantly fewer cases or patients required while eliminating outcome-confounding clinical differences between the original cohorts.We created a stepwise matching protocol in which the cases were matched from the one with the lowest to the highest propensity score (Fig 1).This ascending sort should be better than a descending sort.As seen in our two datasets, cases with lower propensity scores are easier to match (Figs 2 and 4).If a descending sort were used, the paired case would consistently have a lower propensity score, and the covariates would be more likely to remain after matching.Therefore, we predicted that an ascending sort would result in a better matching outcome.PSM is a cost-effective analytical method for omics data analyses and biomarker discovery.Obtaining large omics expression dataset is typically expensive, so cost is often a significant concern for research.By implementing PSM before the experimental process, high-quality case series can be created from a much smaller number of cases.This approach not only provides higher-quality analytical results but also lowers research costs.Additionally, PSM enables direct comparisons between patients.In past omics data analyses, the common method of comparison was between cancer and normal tissue to uncover cancer biomarkers.However, this method might not be appropriate for detecting certain cancer mechanisms, such as treatment response or drug resistance.PSM enables comparisons between patients, which can help identify these mechanisms directly.
In addition to the benefit of minimizing cohort size, PSM enabled us to discover several, potentially important markers for colorectal cancer.From the proteomic marker cohort, PSM affirmed the prognostic value of STAT1 in CRC that we have reported previously [30], and the conclusion was further independently solidified by random forest analysis in this study.From the CRC transcriptome cohort, PSM allowed us to identify several well-known prognostic factors that might have been missed in traditional differential expression analysis due to confounding factors, e.g., ERBB2 (HER2), MALAT1, and MUC5AC.ERBB2 (also known as HER2, neu, and NGL) encodes a transmembrane glycoprotein belonging to the epidermal growth factor receptor family [31][32][33].ERBB2 is one of the most extensively studied molecules in cancer research [34], particularly in breast cancer [35,36].Overexpression of ERBB2 in cancer cells has been linked to metastasis, poor response to cancer treatment, and poor prognosis.In our study, the mRNA abundance of ERBB2 was found to be significantly higher in the poor-prognosis group by post-PSM analysis.However, the significance of this difference was not identifiable without PSM.MALAT1 (metastasis-associated lung adenocarcinoma transcript 1) is one of the first identified cancer-associated long noncoding RNAs.MALAT1 was originally described as a prognostic marker of lung cancer metastasis.Upregulated MALAT1 has been observed in various cancers, including CRC, and is associated with cancer metastasis, invasion, and poor prognosis [37][38][39].In this study, despite previous reports linking upregulated MALAT1 to poor prognosis, we observed a significant downregulation of MALAT1 mRNA abundance in the poor prognosis group after PSM.It is worth noting that our study includes a high proportion of advanced-stage cases, with 35.7% of cases in TNM stage IV and 33.9% in stage III.Previous investigations into the expression level of MALAT1 have mainly focused on early-stage patients, and there are limited reports on the expression of MALAT1 in advanced stage patients.
MUC5AC is one of the mucins and a high molecular weight glycoprotein.While it is not expressed in normal colonic mucosa, it is expressed during CRC progression [40,41].However, the function of dysregulated MUC5AC expression has not been well characterized.A meta-analysis by Li et al. found that a high level of MUC5AC was associated with an improved prognosis [42].Consistent with this, the mRNA abundance of MUC5AC was significantly decreased in the poor-prognosis group in our study.It is possible that MUC5AC plays a disease-modifying function in some CRC patients.
PSM also allowed us to discover several lesser studied or previously unknown biomarkers of CRC.For example, a MYH7B coding variant was discovered as one of the eight novel variants associated with CRC risk in a Swedish population from genome-wide association studies of 1,515 CRC patients and 12,108 controls [43].MYH7B association with CRC risk was also reported in a gene expression prediction model of a large cohort of CRC cases that included 58,131 CRC cases and 67,347 controls of European ancestry [26].By applying PSM, we identified the significance of MYH7B in CRC from a much smaller cohort than these two studies.Although further research is needed to confirm many of the potential new markers identified in our study, our results demonstrate the applicability and effectiveness of PSM for omics data analyses.PSM analysis has some well-known limitations.It is crucial to list all possible confounding features without omission because the confounding factors that PSM can eliminate are only the ones that are already known.If unknown confounding features exist, the PSM analysis may not work correctly [44].Additionally, PSM reduces the number of cases, and the result of post-PSM analysis may contain type 2 errors [45,46].In the differential expression analysis of RNA-seq dataset, this explains why there were significantly fewer genes in the analysis after PSM (122 genes after PSM vs. 402 genes before PSM).There is a possibility of overlooking some significant genes that should be focused on due to type 2 errors.This problem could be addressed by collecting a greater number of cases at the beginning.Using 2-to-1 matching in PSM can indeed be a useful arrangement to minimize the problem of type 2 errors [47,48].In this approach, two cases with most similar propensity scores are matched to one case in the opposite group.This can help to increase the number of cases while still maintaining balance in the covariates between the two groups.For the CRC proteomic marker dataset, we used this approach to match a good prognosis group of 104 cases to a poor prognosis group of 52 cases (S1 Fig, S2 and S3 Tables) with overall results very similar to 1-to-1 PSM.This arrangement can be useful in some datasets where increasing the number of cases is important while maintaining a balance of covariates.

Conclusion
Propensity score matching (PSM) is a valuable and cost-effective method in cohort studies and omics data analyses as it allows researchers to create comparable groups with homogenized backgrounds using fewer cases.By reducing the impact of confounding factors, PSM can improve the accuracy and reliability of biomarker discovery.

Fig 1 .
Fig 1. Flow chart of propensity score matching in this study.It is crucial that the number of cases in group A is not larger than in group B. In this study, group A meant the good prognosis group, and group B meant the poor prognosis group in both datasets.https://doi.org/10.1371/journal.pone.0302109.g001

Fig 3 .
Fig 3. Random forest rankings of prognostic factors in the CRC proteomic marker dataset.(A) Ranking of variable importance (VIMP).The blue bars represent positive values of VIMP, indicating that the corresponding factor is positively associated with prognostic prediction.While the red bars represent negative values of VIMP, indicating that the factor is negatively associated with prognostic prediction.(B) Ranking of minimal depth.The small minimal depth indicates that the factor plays an important role in prognostic prediction.The vertical dashed line indicates the minimal depth threshold where smaller minimal depth values indicate higher importance and larger indicate lower importance as calculated by the "gg_minimal_depth" function of the "ggRandomForests" R package (version 4.7-1.1).(C) The combination of variable importance (VIMP) and minimal depth.The blue dots represent positive values of VIMP, while red dots represent negative values of VIMP.The threshold represented by the vertical red dashed line indicates VIMP = 0.The threshold represented by the horizontal red dashed line is equal to (B).https://doi.org/10.1371/journal.pone.0302109.g003